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ABSTRACT 

We present a work-in-progress snapshot of learning with a 15 billion 
parameter deep learning network on HPC architectures applied to the 
largest publicly available natural image and video dataset released 
to-date. Recent advancements in unsupervised deep neural networks 
suggest that scaling up such networks in both model and training 
dataset size can yield significant improvements in the learning of 
concepts at the highest layers. We train our three-layer deep neural 
network on the Yahoo! Flickr Creative Commons lOOM dataset. The 
dataset comprises approximately 99.2 million images and 800,000 
user-created videos from Yahoo’s Flickr image and video sharing 
platform. Training of our network takes eight days on 98 GPU nodes 
at the High Performance Computing Center at Lawrence Livermore 
National Laboratory. Encouraging preliminary results and future re¬ 
search directions are presented and discussed. 

Index Terms — Deep Learning, Autoencoders, High Perfor¬ 
mance Computing 

1. INTRODUCTION 

The field of deep learning via stacked neural networks has received 
renewed interest in the last decade Gill 121. Neural networks have 
been shown to perform well in a wide variety of tasks, including 
text analysis (41, speech recognition Eli in, various classification 
tasks Em. and most notably unsupervised and supervised feature 
learning on natural imagery diEiia. 

Deep neural networks applied to natural images have demon¬ 
strated state-of-the-art performance in supervised object recognition 
tasks (TOl [T1 as well as unsupervised neural networks EE. The 
classical approach to training neural networks for computer vision is 
via a large dataset of labeled data. However, sufficiently large and 
accurately labeled data is difficult and expensive to acquire. Moti¬ 
vated by this, (3 explored the application of deep neural networks 
in unsupervised deep learning and discovered that sufficiently large 
deep networks are capable of learning highly complex concept level 
features at the top level without labels. 

Spurred by this advancement, (J) set out to construct very large 
networks on the order of 10^ to 10^° parameters. A key advance¬ 
ment was the highly efficient multi-GPU architecture of their model. 
(3 employed a high degree of model parallelism and was able to 
process 10 million YouTube thumbnails in a few days processing 
time on a medium sized cluster. A notable result was the unsuper¬ 
vised learning of various faces, including those of humans and cats. 
Ultimately, improved feature learning at larger scales can improve 
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downstream capabilities such as scene or object classification, addi¬ 
tional unsupervised learning {Le. via topic modeling m or natural 
language processing algorithms iTTl ). 

In collaboration with the authors of E, we have scaled a sim- 
ilar model and architecture to over 15 billion parameters on the 
Lawrence Livermore National Laboratory’s (LLNL) Edge High 
Performance Computing (HPC) system. Our long-term goal is 
two-fold: (1) explore at-the-limit performance of massive networks 
(> 10 billion parameters) and (2) train on and analyze datasets on 
the order of 100 million images. 

As the number of network parameters grow, datasets need to 
be scaled accordingly to avoid overfitting the models. We take ad¬ 
vantage of a brand-new dataset released jointly by Yahoo!, LLNL 
and the International Computer Science Institute (ICSI) called the 
Yahoo! Elickr Creative Commons lOOM (YECCIOOM) datasetlO. 
The dataset is, to the authors’ knowledge, the largest single publicly 
available image and video dataset ever published. In addition to the 
raw images and video, the YECCIOOM also contains metadata for 
each entry including locations, camera types, keywords, titles, etc. 
Although beyond the scope of this paper, this rich associated meta¬ 
data potentially offers researchers additional avenues of semantic 
multi-modality learning to explore. 

Working with the large-scale datasets, models and computing 
architectures considered in this paper presents several daunting en¬ 
gineering challenges. Eor example, the significantly greater number 
of GPUs and compute nodes used in our system versus E creates 
communication issues in MPI. In addition, a typical model takes up 
over 40 GB of memory, making simple offline analysis tasks such 
as visualization challenging. Various network architectures were 
tested, balancing performance and computational constraints, before 
we arrived at our current model. Einally, as in (2, data throughput 
presents a bottleneck to model training. We present a novel pipeline 
approach to address this problem. 

The rest of this paper is organized as follows. In Section we 
give a brief overview of the YECCIOOM dataset. The network archi¬ 
tecture and computational framework being employed is described 
in Section We present preliminary results and visualizations of 
our network in SectionFinally, we summarize and discuss future 
research directions in Section [5] 

2. OVERVIEW OF THE YFCCIOOM DATASET 

In late June 2014, Yahoo! released the Yahoo! Flickr Creative 
Commons dataset (YFCCIOOM). This dataset consists of 100 mil¬ 
lion Flickr user-uploaded images and videos (99,206,564 images and 
793,436 videos) along with their corresponding metadata including 
title, description, camera type, tags, and geotags when available. All 
of the data is under Creative Commons licensing and is freely pro- 



vided to scientists for the advancement of multimedia research Q In 
addition to the raw images, videos, and metadata, Yahoo! in collab¬ 
oration with the ICSI and LLNL will be computing and providing 
standard computer vision and audio features using LLNL’s super¬ 
computing resources. 

Wang et al. lITSl have used YFCCIOOM data to build sys¬ 
tems that associate images with more natural annotations like those 
found in user-generated captions. Others are interested in using the 
YFCCIOOM imagery and audio to geolocate where the photo or 
video was taken lfT4l . In fact, the 2014 MediaEval Placing Task is 
using YFCCIOOM as the source of benchmark data na. We are 
interested in using YFCCIOOM as our sandbox dataset for learning 
image features using massive unsupervised neural networks, repeat¬ 
ing the experiment by (S) on an order of magnitude more data and 
neural network parameters. In particular, we want to see what other 
“grandmother neurons” O our network would automatically learn 
from YFCCIOOM. 


Table 1. Top 60 Tags in YFCCIOOM Images 
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The 99,206,564 images were created and posted by 578,268 dif¬ 
ferent Flickr users. 76%, 20%, and 4% of the images have titles, 
auto-titles, or no titles, respectively. The average number of words 
per title is 3.08. 32% of the images have descriptions with an aver¬ 
age of 22.52 words per description. Finally, 69% of the images have 
on average 7.07 tags per image. The top 60 tags are shown in Ta¬ 
ble In Fig. we show example images and associated meta-data 
for several YFCCIOOM images. 

3. ANALYSIS WITH LARGE SCALE NEURAL 
NETWORKS 

3.1. Network Architecture 

For the large set of image data, we employed a three-layer, large- 
scale deep neural network with a reconstruction independent com¬ 
ponent analysis (RICA) cost function. 


sparse autoencoder construction that requires a second pass through 
the data to compute a sparseness-specific gradient contribution. 

Fig. [^illustrates the structure of our network. The three layers 
are composed of two untied convolutional layers, and a third fully- 
connected layer. The first convolutional layer utilizes 5184 filters [^ 
of input size 16 x 16 x 3 with stride 4 and outout size 4 x 4 x 24. 
The second layer takes 16 spatially contiguous [j4 x 4 x 24 outputs 
of the first layer and connects them fully to a 4 x 4 x 24 output. 
The stride length of the second layer is 4. The third layer is dense, 
and fully connects the 62 x 62 x 24 outputs of the second layer to 
4096 top-level neurons. The total number of parameters trained is 
15 billion. After each layer, local contrast normalization (LCN) is 
applied prior to continuing onto the next layer. Though no pooling 
is applied, the window sizes at the next layer are large enough to 
incorporate spatial information from neighboring blocks. 


Layer 1: Layer 2: Layer 3; 

Untied Convolutional Untied Convolutional Fully Connected 


Local Contrast Nonnalization (LCN) LCN 



Fig. 2. Network topology of large scale, trained network. Approxi¬ 
mately 15 billion parameters 
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Fig. 3. Pipeline for semi-parallel training of sparse autoencoders 
from a single data source 


min y \\w'^{aWx^^) + 6 - 11^ + \J{aWx(*)Y 

W, a, h ‘ II 112 ^ 

i 

subject to ||FF ^^^||2 = l,V/c, 

where as in , FF is a weighting matrix, a is a scaling value and 
are the data points at the beginning of each layer. In addition, we 
introduce an offset, 6, for increased model fiexibility. The parameter 
A controls the relative sparsity, and is set to 0.1 at the first two layers 
and 0.01 at the final layer. Unlike lO, we do not presently include a 
pooling layer, as we believe the scale of the network and training data 
allows a similar translational invariance to be automatically learned. 
A particular advantage conferred by the RICA construction is that 
the sparseness term can be computed in-situ with the 

rest of the model parameters. This is in contrast to the conventional 

^Available at http://research.yahoo.coni/Academic_Relations 


Training data is arranged into 99,207 data blocks of 960 images. 
Each data block consists of 5 mini-batches, where each mini-batch 
contains 192 images. Due to the scale of the data, the proposed 
algorithm reduces training time by employing a pipeline technique 
where the next layer begins training before the previous layer has 
finished. Analogous to the example shown in Eig. after a layer 
L has trained an initial set of data blocks (in our case, 1000), the 
next layer, L + 1, starts training. To accomplish this, two instances 
of the layer L are run simultaneously: one which continues training 
and one that uses up-to-date parameters to forward propagate data 
from Block 0 to the layer L + 1. The parameters of the forward- 
propagating layer L instance are periodically synchronized with the 
layer L instance that continued training. We observed that our model 
was not sensitive to the choice of synchronization frequency. As a 

^Arranged in a 72 x 72 grid 
^Arranged in a 4 x 4 grid 
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Fig. 1. Examples of YFCC Data, and the associated metadata. Photo credits to Yahoo! users “Dougtone”, “ascaro41”, “mlaaker”, “Ingy The 
Wingy”, “monoprixgourmet_bis”. 


rule of thumb, we wait to train layer L +1 until the objective of layer 
L stabilizes, which typically occurs after approximately one million 
images. 

3.2. HPC Architecture 

To train the neural network at scale, we used 98 nodes of the Edge 
HPC cluster at Lawrence Livermore National Laboratory. The Edge 
cluster consists of 206 nodes with 12 core Intel Xeon EP X5660 
running at 2.8 GHz. Each node has 96 GB of DRAM and a Tesla 
M2050 (Fermi) NVIDIA GPU with 3 GB of GDDR5. The training 
algorithm is model parallel as described in m, with the nodes and 
GPUs processing each mini-batch across the system and distribut¬ 
ing the model across the GPUs. Communication was provided by 
MPI over Mellanox QDR Infiniband cards. The GPU accelerators 
were used with CUDA 5.5 and MPI-direct communication and the 
operating system was a 2.6.32 kernel RHEL 6 derivative. 

The dataset was stored in a Lustre file system with a peak band¬ 
width of 10 GB/s. Each mini-batch was copied from Lustre into 
memory and then streamed into the GPU’s memory. Each GPU is 
responsible for computing its section of the model parameters for 
the current mini-batch. Communication within the algorithm occurs 
when a layer’s input (or output) field spans multiple GPUs. The 


communication is handled by a distributed array data structure (us¬ 
ing MPI) within the training algorithm. Global communication is 
minimized by using untied local receptive fields, and allowing re¬ 
ceptive fields to be trained independently. 

4. PRELIMINARY RESULTS 



Fig. 4. Visualization of a selection of typical first layer weights. The 
right figure is a zoomed-in crop of the left. 


















paper, the test set is significantly noisier than the benchmark Labeled 
Faces In the Wild d and ImageNet im datasets considered in pre¬ 
vious works such as O. 

In Fig.[^ we show the top 5 stimuli for some example neurons. 
We observe that our network is capable of learning significant struc¬ 
ture, identifying buildings, aircraft, text, cityscapes, and tower-like 
buildings, among many others. The network seems to cue in on dis¬ 
tinctive textures such as the edges of text, sides of buildings and 
the sharp edge of airplanes against the smooth gradation of the sky. 
Moreover, the network seems to activate on large-scale structures 
within an image rather than local features. We believe that a signifi¬ 
cant contributor to our networks’ performance is due to its large size 
being able to capture complex concepts. 

While our results are encouraging, we believe that significant 
improvements in learning can be achieved through improved net¬ 
work architecture and increased depth. As was demonstrated in fT], 
network architecture has a significant impact on the performance of 
deep networks. While the networks described in O were able to 
learn complex features in just three layers, our results suggest that 
extremely large datasets such as the YFCCIOOM can support (and 
possibly benefit from) deeper networks with improved high-level 
concept learning. 

5. SUMMARY AND FUTURE WORK 






Fig. 5. Top-5 stimuli of example layer 3 neurons. Images have been 
whitened. 


The results discussed in this paper present a snapshot of the work 
in progress at Lawrence Livermore National Laboratory in scaling 
up deep neural networks. Such networks offer enormous potential 
to researchers in both supervised and unsupervised computer vision 
tasks, from object recognition and classification to unsupervised fea¬ 
ture extraction. 

To date, we see highly encouraging results from training 
our large 15 billion parameter three-layer neural network on the 
YFCCIOOM dataset in an unsupervised manner. The results suggest 
that the network is capable of learning highly complex concepts 
such as cityscapes, aircraft, buildings, and text, all without labels 
or other guidance. That this structure is visible upon examination 
is made all the more remarkable due to the noisiness of our test set 
(taken at random from the YFCCIOOM dataset itself). 

Future work on our networks will focus on two main thrusts: (1) 
improve the high-level concept learning by increasing the depth of 
our network, and (2) scaling our network’s width in the middle lay¬ 
ers. On the first thrust, we aim for improved high-level summariza¬ 
tion and scene understanding. Challenges on this front include care¬ 
ful tuning of parameters to combat the “vanishing gradient” problem 
and design of the connectivity structure of the higher-level layers to 
maximize learning. On the second thrust, our challenges are primar¬ 
ily engineering focused. Memory and message passing constraints 
become a serious concern, even on the large HPC systems fielded 
by LLNL. As we move beyond our current large neural network, we 
plan to explore the use of memory hierarchies for staging interme¬ 
diate/input data to minimize the amount of node-to-node commu¬ 
nication, enabling the efficient training and analysis of even larger 
networks. 


We trained the network using all images from the YFCCIOOM 
dataset. Images were preprocessed as in |[2l, and subsequently re¬ 
sized to 300 X 300 pixels by first centering, then scaling the smallest 
dimension to 300 pixels, and finally cropping. After training all three 
layers, we forward propagated 2 million images through the network 
in order to obtain activation values for visualization. Note that in this 
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